1. Stat405
Advanced data manipulation
Hadley Wickham
Tuesday, 28 September 2010
2. 1. Baby names data
2. Slicing and dicing revision
3. Merging data
4. Group-wise operations
Tuesday, 28 September 2010
3. Baby names
Top 1000 male and female baby
names in the US, from 1880 to
2008.
258,000 records (1000 * 2 * 129)
But only five variables: year,
name, soundex, sex and prop.
CC BY http://www.flickr.com/photos/the_light_show/2586781132
Tuesday, 28 September 2010
4. Getting started
library(plyr)
library(ggplot2)
options(stringsAsFactors = FALSE)
# Can read compressed files
bnames <- read.csv("baby-names2.csv.bz2")
# Can read files from website
births <- read.csv(
"http://had.co.nz/stat405/data/births.csv")
# Unfortunately can't do both at the same time :(
Tuesday, 28 September 2010
5. > head(bnames, 20) > tail(bnames, 20)
year name soundex prop sex year name soundex prop sex
1 1880 John J500 0.081541 boy 257981 2008 Miya M000 0.000130 girl
2 1880 William W450 0.080511 boy 257982 2008 Rory R600 0.000130 girl
3 1880 James J520 0.050057 boy 257983 2008 Desirae D260 0.000130 girl
4 1880 Charles C642 0.045167 boy 257984 2008 Kianna K500 0.000130 girl
5 1880 George G620 0.043292 boy 257985 2008 Laurel L640 0.000130 girl
6 1880 Frank F652 0.027380 boy 257986 2008 Neveah N100 0.000130 girl
7 1880 Joseph J210 0.022229 boy 257987 2008 Amaris A562 0.000129 girl
8 1880 Thomas T520 0.021401 boy 257988 2008 Hadassah H320 0.000129 girl
9 1880 Henry H560 0.020641 boy 257989 2008 Dania D500 0.000129 girl
10 1880 Robert R163 0.020404 boy 257990 2008 Hailie H400 0.000129 girl
11 1880 Edward E363 0.019965 boy 257991 2008 Jamiya J500 0.000129 girl
12 1880 Harry H600 0.018175 boy 257992 2008 Kathy K300 0.000129 girl
13 1880 Walter W436 0.014822 boy 257993 2008 Laylah L400 0.000129 girl
14 1880 Arthur A636 0.013504 boy 257994 2008 Riya R000 0.000129 girl
15 1880 Fred F630 0.013251 boy 257995 2008 Diya D000 0.000128 girl
16 1880 Albert A416 0.012609 boy 257996 2008 Carleigh C642 0.000128 girl
17 1880 Samuel S540 0.008648 boy 257997 2008 Iyana I500 0.000128 girl
18 1880 David D130 0.007339 boy 257998 2008 Kenley K540 0.000127 girl
19 1880 Louis L200 0.006993 boy 257999 2008 Sloane S450 0.000127 girl
20 1880 Joe J000 0.006174 boy 258000 2008 Elianna E450 0.000127 girl
Tuesday, 28 September 2010
6. Your turn
Extract your name from the dataset. Plot
the trend over time.
What geom should you use? Do you
need any extra aesthetics?
Tuesday, 28 September 2010
7. hadley <- subset(bnames, name == "Hadley")
qplot(year, prop, data = hadley, colour = sex,
geom ="line")
# :(
Tuesday, 28 September 2010
8. Your turn
Use the soundex variable to extract all
names that sound like yours. Plot the
trend over time.
Do you have any difficulties? Think about
grouping.
Tuesday, 28 September 2010
9. gabi <- subset(bnames, soundex == "G164")
qplot(year, prop, data = gabi)
qplot(year, prop, data = gabi, geom = "line")
qplot(year, prop, data = gabi, geom = "line",
colour = sex) + facet_wrap(~ name)
qplot(year, prop, data = gabi, geom = "line",
colour = sex, group = interaction(sex, name))
Tuesday, 28 September 2010
10. Sawtooth appearance
implies grouping is incorrect.
0.005
0.004
sex
prop
0.003 boy
girl
0.002
0.001
1880 1900 1920 1940 1960 1980 2000
year
Tuesday, 28 September 2010
12. Function Package
subset base
summarise plyr
transform base
arrange plyr
They all have similar syntax. The first argument
is a data frame, and all other arguments are
interpreted in the context of that data frame.
Each returns a data frame.
Tuesday, 28 September 2010
13. color value color value
blue 1 blue 1
black 2 blue 3
blue 3 blue 4
blue 4
black 5
subset(df, color == "blue")
Tuesday, 28 September 2010
14. color value color value double
blue 1 blue 1 2
black 2 black 2 4
blue 3 blue 3 6
blue 4 blue 4 8
black 5 black 5 10
transform(df, double = 2 * value)
Tuesday, 28 September 2010
15. color value double
blue 1 2
black 2 4
blue 3 6
blue 4 8
black 5 10
summarise(df, double = 2 * value)
Tuesday, 28 September 2010
16. color value total
blue 1 15
black 2
blue 3
blue 4
black 5
summarise(df, total = sum(value))
Tuesday, 28 September 2010
17. color value color value
4 1 1 2
1 2 2 5
5 3 3 4
3 4 4 1
2 5 5 3
arrange(df, color)
Tuesday, 28 September 2010
18. color value color value
4 1 5 3
1 2 4 1
5 3 3 4
3 4 2 5
2 5 1 2
arrange(df, desc(color))
Tuesday, 28 September 2010
19. Your turn
Calculate the total, largest and smallest
proportions.
Reorder the data frame containing your
name from highest to lowest popularity.
Tuesday, 28 September 2010
20. summarise(bnames,
total = sum(prop),
largest = max(prop),
smallest = min(prop))
arrange(hadley, desc(prop))
Tuesday, 28 September 2010
21. Brainstorm
Thinking about the data, what are some
of the trends that you might want to
explore? What additional variables would
you need to create? What other data
sources might you want to use?
Pair up and brainstorm for 2 minutes.
Tuesday, 28 September 2010
22. External Internal
First/last letter
Biblical names
Length
Hurricanes
Vowels
Ethnicity
Rank
Famous people
Sounds-like
join ddply
Tuesday, 28 September 2010
24. Combining datasets
Name instrument Name band
John guitar John T
Paul bass Paul T
George guitar
Ringo drums
+ George T
Ringo T
= ?
Stuart bass Brian F
Pete drums
Tuesday, 28 September 2010
25. x y
Name instrument Name band Name instrument band
John guitar John T John guitar T
Paul bass Paul T Paul bass T
George guitar + George T = George guitar T
Ringo drums Ringo T Ringo drums T
Stuart bass Brian F Stuart bass NA
Pete drums Pete drums NA
join(x, y, type = "left")
Tuesday, 28 September 2010
26. x y
Name instrument Name band Name instrument band
John guitar John T John guitar T
Paul bass Paul T Paul bass T
George guitar + George T = George guitar T
Ringo drums Ringo T Ringo drums T
Stuart bass Brian F Brian NA F
Pete drums
join(x, y, type = "right")
Tuesday, 28 September 2010
27. x y
Name instrument Name band Name instrument band
John guitar John T John guitar T
Paul bass Paul T Paul bass T
George guitar + George T = George guitar T
Ringo drums Ringo T Ringo drums T
Stuart bass Brian F
Pete drums
join(x, y, type = "inner")
Tuesday, 28 September 2010
28. x y
Name instrument Name band Name instrument band
John guitar John T John guitar T
Paul bass Paul T Paul bass T
George guitar + George T = George guitar T
Ringo drums Ringo T Ringo drums T
Stuart bass Brian F Stuart bass NA
Pete drums Pete drums NA
Brian NA F
join(x, y, type = "full")
Tuesday, 28 September 2010
29. Type Action
Include all of x, and
"left"
matching rows of y
Include all of y, and
"right"
matching rows of x
Include only rows in
"inner"
both x and y
"full" Include all rows
Tuesday, 28 September 2010
30. Your turn
Convert from proportions to absolute
numbers by combining bnames with births,
and then performing the appropriate
calculation.
Tuesday, 28 September 2010
31. bnames2 <- join(bnames, births,
by = c("year", "sex"))
tail(bnames2)
bnames2 <- transform(bnames2, n = prop * births)
tail(bnames2)
bnames2 <- transform(bnames2,
n = round(prop * births))
tail(bnames2)
Tuesday, 28 September 2010
32. 2000000
1500000
sex
births
boy
1000000 girl
ild
ch
n or
io f
d
ct ed
500000
ue
du ed
ss
de ne
ti
rs
x :
ta 86
:fi
19
36
19
1880 1900 1920 1940 1960 1980 2000
year
Tuesday, 28 September 2010
33. Group-wise
operations
Tuesday, 28 September 2010
34. Number of people
How do we compute the number of
people with each name over all years? It’s
pretty easy if you have a single name.
How would you do it?
Tuesday, 28 September 2010
35. hadley <- subset(bnames2, name == "Hadley")
sum(hadley$n)
# Or
summarise(hadley, n = sum(n))
# But how could we do this for every name?
Tuesday, 28 September 2010
36. # Split
pieces <- split(bnames2, list(bnames$name))
# Apply
results <- vector("list", length(pieces))
for(i in seq_along(pieces)) {
piece <- pieces[[i]]
results[[i]] <- summarise(piece, n = sum(n))
}
# Combine
result <- do.call("rbind", results)
Tuesday, 28 September 2010
37. # Or equivalently
counts <- ddply(bnames2, "name", summarise,
n = sum(n))
Tuesday, 28 September 2010
38. Way to split
Input data
up input
# Or equivalently
counts <- ddply(bnames2, "name", summarise,
n = sum(n))
Function to apply to
each piece
2nd argument
to summarise()
Tuesday, 28 September 2010
39. x y
a 2
a 4
b 0
b 5
c 5
c 10
Tuesday, 28 September 2010
40. Split
x y
x y a 2
a 2 a 4
a 4 x y
b 0 b 0
b 5 b 5
c 5 x y
c 10 c 5
c 10
Tuesday, 28 September 2010
41. Split Apply
x y
x y a 2
3
a 2 a 4
a 4 x y
b 0 b 0
2.5
b 5 b 5
c 5 x y
c 10 c 5
7.5
c 10
Tuesday, 28 September 2010
42. Split Apply Combine
x y
x y a 2
3
a 2 a 4
a 4
x y
x y
a 2
b 0 b 0
2.5 b 2.5
b 5 b 5
c 7.5
c 5 x y
c 10 c 5
7.5
c 10
Tuesday, 28 September 2010
43. Your turn
Repeat the same operation, but use
soundex instead of name. What is the
most common sound? What name does
it correspond to?
Tuesday, 28 September 2010
44. scounts <- ddply(bnames2, "soundex", summarise,
n = sum(n))
scounts <- arrange(scounts, desc(n))
# Combine with names
# When there are multiple possible matches,
# join picks the first
scounts <- join(
scounts, bnames2[, c("soundex", "name")],
by = "soundex")
head(scounts, 100)
subset(bnames, soundex == "L600")
Tuesday, 28 September 2010
45. # Alternative approach that you'll learn more
# about on Thursday
library(stringr)
scounts <- ddply(bnames2, "soundex", summarise,
n = sum(n),
names = str_c(sort(unique(name)), collapse = ","))
scounts <- arrange(scounts, desc(n))
Tuesday, 28 September 2010